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Abstract 

Background: In many scientific domains, it is becoming increasingly common to collect high-dimensional data 
sets, often with an exploratory aim, to generate new and relevant hypotheses. The exploratory perspective often 
makes statistically guided visualization methods, such as Principal Component Analysis (PCA), the methods of 
choice. However, the clarity of the obtained visualizations, and thereby the potential to use them to formulate 
relevant hypotheses, may be confounded by the presence of the many non-informative variables. For microarray 
data, more easily interpretable visualizations are often obtained by filtering the variable set, for example by 
removing the variables with the smallest variances or by only including the variables most highly related to a 
specific response. The resulting visualization may depend heavily on the inclusion criterion, that is, effectively the 
number of retained variables. To our knowledge, there exists no objective method for determining the optimal 
inclusion criterion in the context of visualization. 

Results: We present the projection score, which is a straightforward, intuitively appealing measure of the 
informativeness of a variable subset with respect to PCA visualization. This measure can be universally applied to 
find suitable inclusion criteria for any type of variable filtering. We apply the presented measure to find optimal 
variable subsets for different filtering methods in both microarray data sets and synthetic data sets. We note also 
that the projection score can be applied in general contexts, to compare the informativeness of any variable 
subsets with respect to visualization by PCA. 

Conclusions: We conclude that the projection score provides an easily interpretable and universally applicable 
measure of the informativeness of a variable subset with respect to visualization by PCA, that can be used to 
systematically find the most interpretable PCA visualization in practical exploratory analysis. 



Background relevant biological knowledge from the observed data, and 

High-dimensional data sets, where the observed variables the performance of many machine learning algorithms 

often by far outnumber the samples, are becoming may drastically improve if the number of variables is 

increasingly prevalent in many scientific domains. As an reduced before or in conjunction with the application of 

example, microarrays are used extensively to measure a the algorithm. 

variety of genomic attributes such as gene expression and The exploratory perspective on high-dimensional data 

DNA copy numbers. A peculiar feature of many high- means that visualization methods, providing a graphical 

dimensional data sets is that they are often collected with representation of the data, can be of great assistance. One 

an exploratory aim, without a specific hypothesis in mind. of the most commonly used visualization methods for 

This means, among other things, that a data set may con- microarray data is Principal Component Analysis (PCA) 

tain many variables that are not really informative, while [1-4] which provides coupled, low- dimensional representa- 

the informative structure is contained in a small subset of tions of the samples and variables in a data set. PCA is 

the variables. The presence of non- informative variables applicable also to very high-dimensional data sets, but the 

can have detrimental effects on the possibility to extract resulting visualization can be severely confounded and the 

truly informative structure can be concealed by the pre- 

* Correspondence: fontes@maths.lth.se sence of a large number of non-informative variables. 
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improve the interpretability of PCA visualizations. In the 
microarray context, one commonly used approach to 
reduce the dimensionality is to simply exclude the vari- 
ables with the lowest variances before applying PCA [5-8], 
implicitly assuming that they do not contain any useful 
information concerning the samples in the data set. Such 
variable filtering often provides more informative and 
easily interpretable PCA representations. Other ways to 
reduce the dimension before PCA are also common, such 
as excluding the variables with the lowest signal intensities 
[9] or including only those variables that are highly related 
to a specific response [6,7,10]. 

Naturally, the resulting visualization may be highly 
dependent on the number of included variables and, 
hence, on the variance or significance threshold used as 
the inclusion criterion. To our knowledge, there exist no 
well-motivated, objective criteria that are useful for 
obtaining good stopping rules in such variable filterings, 
and therefore most studies apply some kind of ad-hoc 
criterion. In this paper, we present the projection score, 
which is a straightforward, intuitively appealing measure 
of the informativeness of a variable subset with respect 
to visualization. Intuitively, the projection score com- 
pares the variance captured by a pre-defined set of prin- 
cipal components to the expected value under the 
assumption that the variables are independent, that is, 
that there is no non-random structure in the data. The 
projection score can be universally applied to provide a 
stopping rule for a variable filtering, and hence it can be 
used to systematically find the most interpretable PCA 
visualizations in practical exploratory analysis applica- 
tions, for example in microarray data analysis. More- 
over, as we will see in this paper, the projection score 
can be applied in more general contexts than those indi- 
cated above, to compare the informativeness of any vari- 
able subsets with respect to visualization by PCA. 

Related work 

Many variable selection techniques have been devised for 
use with supervised learning algorithms, such as classifiers 
or predictors. In the supervised learning context, measur- 
ing the informativeness of a variable subset is fairly 
straightforward. Probably the most common approach is 
to use some kind of cross-validation scheme to train the 
classifier on the samples in a training set and define the 
informativeness of the variable subset based on the perfor- 
mance of the obtained classifier in an independent test set 
[11]. For unsupervised learning methods, such as visualiza- 
tion and clustering, which do not make use of any class 
labels or other response variables, this approach is not 
applicable. Variable selection for model-based clustering, 
where the data are assumed to come from a mixture of 
several subpopulations, has been considered [12,13]. 
In this case, the value of the objective function depends on 



how well the data comply with the assumed mixture 
model. For visualization purposes however, it is less clear 
what constitutes an optimal result. The projection score 
defined in this paper measures the informativeness of a 
variable subset based on the variance accounted for by a 
subset of the principal components of the variable subset, 
without using any side information such as the class mem- 
berships of samples. 

Variable selection in the context of PCA has been con- 
sidered for many years, but often with the aim of approxi- 
mating the original data set as well as possible with only a 
subset of the variables. Moreover, the proposed methods 
do not in general address the problem of finding a suitable 
stopping criteria for variable filtering. The potential advan- 
tage of sparse components, in terms of interpretability, 
were recognized already in the 1970s [14,15]. One com- 
mon approach to variable selection for PCA has been to 
fix the desired number of variables in advance and search 
for the optimal variable subset of this pre-selected cardin- 
ality [4,16-19]. To comply with the original goals of PCA, 
the optimal variable subset is often defined as the one con- 
taining the largest amount of variance, or providing the 
best approximation to the original data set. Hence, one 
searches for a subset of the variables containing essentially 
the same information as the entire variable set. This goal 
is common also to many of the recently proposed sparse 
PCA methods [20-22]. These methods often take their 
starting point in one of the optimality properties of ordin- 
ary PCA and introduce a penalty (often a lasso [23] or 
elastic net [24] penalty) to limit the number of non-zero 
principal component weights. The algorithms are in gen- 
eral developed for a fixed penalty (that is, a fixed number 
of non-zero weights) and the optimal penalty is deter- 
mined by cross-validation, trying to approximate a test set 
as well as possible, or by ad-hoc methods. In contrast to 
these methods, the main objective of the approach pro- 
posed in this paper is to compare subsets of different sizes, 
such as, for example, subsets obtained by variance filtering 
with different inclusion thresholds. Our aim is not expli- 
citly to approximate the original data set as well as possi- 
ble, but rather to find informative structures which may 
not be apparent by considering the entire data set. By fil- 
tering the variable set with respect to a specific factor 
(such as variance or the association with a response) we 
obtain sparse principal components where the included 
variables are all related to the same factor. These may be 
easier to interpret than general sparse principal compo- 
nents. With our approach, it is also possible to compare 
the informativeness contained in different factors, by filter- 
ing with respect to the association with each one of them 
and comparing the best projection scores obtained for 
each factor. 

Another related approach for variable selection, or 
variable clustering, using PCA is the gene shaving 
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procedure [25,26]. This procedure starts by computing 
the first principal component of the data. In each step, 
the variables with the smallest loadings are shaved off, 
and a new principal component is computed from the 
remaining variables. This yields a nested sequence of 
variable subsets. For each subset, a quality measure is 
computed as the ratio between the variance of the mean 
value of the expression levels of the genes in the subset 
and the total variance of the subset. Then, this quality 
measure is compared to what would be expected from 
random data, using the gap statistic (that is, the differ- 
ence between the observed and the expected value). The 
variable subset with the highest value of the gap statistic 
is considered to be the optimal variable subset (called a 
gene cluster). We will show that one way to use the 
projection score is to obtain another quality measure of 
the variable subsets that can be used in such a shaving 
approach, and that the optimal variable set is not neces- 
sarily the same as with the method described in [25,26]. 

Biclustering methods based on sparse matrix decompo- 
sitions have been proposed by several authors (see for 
example [27] and references therein). Here, the aim is to 
find a subset of the variables which are correlated across a 
subset of the samples and hence sparsity is induced for 
both samples and variables. The biclustering methods use 
different measures to evaluate subsets of a given data set. 
For example, in [28] the quality of a submatrix is defined 
based on the average expression value in the submatrix. 
Biclustering methods have been shown to be useful for 
finding informative patterns which are not necessarily pre- 
sent across the entire data set. However, they are not 
explicitly optimized for visualization, and, as the sparse 
PCA methods, they are not used to find stopping criteria 
for variable filtering. 

Another variable selection criterion for PCA was 
described in [29]. For each variable, the authors compute 
the difference between the entropy of the entire data set 
and the entropy of the data set with the variable removed. 
This is used as a measure of the contribution of each of 
the variables, and the variables are ranked according to 
their contributions. The optimal variable subset is consid- 
ered to consist of all variables whose contributions are 
more than one standard deviation higher than the mean 
value of all contributions. In contrast to this method, we 
compute the projection score for a large number of vari- 
able subsets and select the one with the highest value as 
the most informative in a visualization context. 

Results and Discussion 

In this section, we will first apply different filtering tech- 
niques, with varying inclusion criteria, to generate a col- 
lection of variable subsets from each of three microarray 
data sets. The projection score will be applied to find the 



most informative variable subset in each example. It is 
important to note that the informativeness is measured 
with respect to unsupervised, exploratory analysis by 
PCA, where the aim often is generation of new hypoth- 
eses rather than validation of existing hypotheses. This 
means that we could use the projection score as a quality 
measure even in the absence of any side information 
about, for example, sample groups. It can also be used to 
quantify and compare the informativeness of variable 
subsets obtained by supervised methods, for example 
subsets consisting of the variables most highly related to 
a given response. Regardless of how the variable subsets 
are obtained, the projection score evaluates their infor- 
mativeness from an unsupervised perspective, based on 
the variance contained in a pre-defined subset of the 
principal components of the respective variable subsets. 

In all cases, when we vary the inclusion criterion (in 
most of our examples, a single parameter), the projection 
score traces out a reasonably smooth curve, often with a 
clear maximum, which means that it is reasonable to say 
that there is indeed a maximally informative subset in the 
proposed sense. In this article, this curve will be referred 
to as the projection score curve for the filtering parameter. 
In the first example, we filter the variable set by variance 
and find an informative subset of variables with high var- 
iances, providing a graphical sample representation which 
is more easily interpretable than the one obtained from all 
variables. In the second example, we filter the variable set 
by the association with given responses, and show that the 
optimal projection score and the shape of the projection 
score curve obtained from filtering with respect to a cer- 
tain response capture the overall evidence in the data for a 
significant association with that response. We also show 
the results from a combined variance and response-related 
filtering. In the third example, we apply a shaving proce- 
dure to generate variable subsets, and evaluate the 
obtained subsets by their projection scores. 

Then, we validate the use of the projection score with 
synthetic examples, where we show that the variable sub- 
set with the highest projection score is often the one con- 
taining the non-random structure in the data. Finally, we 
discuss some issues regarding the estimation of the pro- 
jection score and warn against a potential pitfall. 

In all examples, we use the projection score to com- 
pare the informativeness of variable subsets with respect 
to visualization. Assume that we are given a data set 
with N samples and p variables, represented by a rank-r 
matrix X e R pxN , and a collection of M variable subsets 
R m <= {1, p} for m = 1, M. We define selection 
functions (p m in such a way that 0 m (X) e [Rl#m|xN con- 
sists only of the rows in X with indices in R m . To com- 
pute the projection score, we must also select an S <= {1, 
r}, essentially representing the number of degrees of 
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freedom we expect in the data. Letting A x = (Ai, A r ) 
denote the vector of non-zero singular values of X in 
decreasing order, the fraction of the variance in X which 
is accounted for by the principal components with 
indices in S can be computed as 



oil (A x , S) 



J2keS 

n 



i 1 



The projection score of R m is then defined as 

r(0 m (X),S,P MX) ) = MA 0m(X ),S)) 1/2 

- E ^ m( x) [(^(A 0m(X ),S)) 1/2 ]. 

Here, 7^ m (x) denotes the inferred distribution of (p m (X) 
under the assumption of independence among the origi- 
nal samples and variables in X. We compute the projec- 
tion score for each of the M variable subsets, and the 
subset with the largest projection score is considered 
the most informative variable subset for visualization. 
For further details, see the Methods section. To obtain a 
visualization which reflects the correlation structure 
between the variables of the data set instead of the indi- 
vidual variances, we consistently extract the singular 
values from standardized data matrices, where each vari- 
able is mean-centered and scaled to unit variance. This 
standardization is indeed commonly used [30,31]. If we 
had not standardized the data, the correlations between 
the variables would be less influential when computing 
the principal components, and the variances of the indi- 
vidual variables could potentially have a very large 
impact. Since we define the projection score by compar- 
ing the observed data to data generated under a null 
model defined by assuming independence between the 
variables, we may argue that the non-random structure 
that is detected with the resulting score is that corre- 
sponding to correlations between variables and that 
therefore, the standardized data are better suited for our 
purposes. Note that for the variance- and response- 
related filterings, the variable rankings are extracted 
from the original, unstandardized data. 

Variance filtering of a lung cancer data set 

We first use the projection score to find particularly 
informative variable subsets among those obtained by 
applying variance filters to the lung cancer data set stu- 
died in [27,32]. The data set contains gene expression 
measurements for 12,625 genes in 56 subjects. The sub- 
jects belong to four groups: Normal (N = 17), Pulmonary 
carcinoid tumors (N = 20), Small cell carcinoma (N = 6) 
or Colon metastases (N = 13). For a given set of variance 
thresholds {0 m }^ =1 (defined for the original, unstandar- 
dized data), we define R m as the collection of variables 



with variance exceeding 6 m . The variance thresholds are 
specified as fractions of the highest individual variable 
variance in the data set, meaning that the variables 
included in the variable subset R m are those with var- 
iances exceeding 6 m ♦ v<2r max where v^r max denotes the 
maximal variance among the variables in the data set. 
We choose S = {1, 2, 3}, that is, we search for a variable 
subset providing an informative three-dimensional sam- 
ple representation, which is reasonable in a visualization 
context. Figure 1(a) shows the three-dimensional sample 
representation obtained by applying PCA only to the 
variables in the most informative subset. The representa- 
tion based on the most informative subset is considerably 
more easily interpretable than the representation based 
on the entire set of variables, which is shown in Figure 1 
(b). In the representation based on the most informative 
variable subset, the pulmonary carcinoid tumors (shown 
in red) appear to cluster into several subgroups, an effect 
which was also noted in [27]. Figure 1(c) shows the pro- 
jection score curve for the filtering parameter 6. Very 
small variable subsets, corresponding to high values of 
the variance threshold, do not support the chosen S (see 
Methods section). The projection score attains its maxi- 
mal value of r = 0.534 for a variance threshold of 7.76% 
of the maximal variance, corresponding to a variable sub- 
set consisting of the 591 variables with highest variances. 
In Figure 1(d), we show a heatmap for the expression 
matrix corresponding to the most informative variable 
subset. Clearly, the variables with the highest variances 
contain much information about the four sample groups, 
which is not surprising. 

Response-related filtering of the NCI-60 data set 

In this section, we construct (p m (X) to consist of the 
variables which are most highly related to a given 
response variable. In the studied examples the response 
variable indicates the partition of the samples into dif- 
ferent groups. Given such a partition, we calculate the 
F-statistic, contrasting all these groups, for each variable. 
For a given set of significance thresholds {(x m }m=v a ^ 
variables which are significantly related to the response 
at the level a m (that is, all variables with a j?-value 
below a m ) are included in (p m (X). For each randomized 

data set X* used to estimate E^ m(x) [(a 2 (A 0m ( X ), S)) 1/2 J 

(see the Methods section) we define the significance 
thresholds a* in such a way that the resulting variable 
subsets have the same cardinalities as the corresponding 
subsets from the original data set. The choice of S is 
guided by the underlying test statistic. When we con- 
trast only two groups, we do not expect the optimal 
variable subset to support more than a one-dimensional 
sample configuration, and therefore we choose S = {1} 
in this case. When contrasting more than two groups, 
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Figure 1 Variance filtering of the lung cancer data, (a) The sample representation obtained by applying PCA to the most informative variable 
subset obtained by variance filtering, containing 591 genes. The different colors indicate different cancer subtypes, (b) The sample representation 
obtained by applying PCA to the entire variable set (12,625 variables), (c) The projection score as a function of the variance threshold 9 (fraction 
of maximal variance) used for inclusion, (d) Heatmap of the most informative variable subset, that is, the one used to create the sample 
representation in (a). In panels (a) and (b), in order to obtain more easily interpretable plots, we joined the closest neighbors among the samples 
with line segments. The distance between two samples is defined by the Euclidean distance in the space spanned by all the remaining 
variables. The hierarchical clusterings in panel (d) are created using Euclidean distances and average linkage. The figures in (a), (b) and (d) were 
generated using Qlucore Omics Explorer 2.2 (Qlucore AB, Lund, Sweden). 



the choice of S must be guided by other criteria. This is 
because the variables with the highest F-score may in 
this case very well characterize many different sample 
groups, not all of which can simultaneously be accu- 
rately visualized in low dimension. 

The NCI-60 data set [33] contains expression measure- 
ments of 9,706 genes in 63 cell lines from nine different 
types of cancers. We first filter the variable set with 
respect to the association with the partition defined by all 
the nine cancer types, using S = {1, 2, 3}. This gives a most 
informative subset consisting of 482 variables, with a pro- 
jection score r = 0.351. The resulting sample representa- 
tion, obtained by applying PCA to the most informative 
variable subset, is shown in Figure 2(a). The representation 
based on all variables is shown in Figure 2(b) and the 



projection score is shown in Figure 2(c) as a function of 
log 10 (a). Figure 2(d) shows the ^-value distribution, which 
indicates that there are indeed variables which are truly 
significantly associated to the response. Figure 2(e) shows 
a heatmap for the most informative variable subset (the 
same as in Figure 2(a)) where it can be seen that the sam- 
ples are reasonably well clustered according to cancer type 
based on these 482 variables only. 

Next, we set out to study how informative the contrasts 
between one cancer type and all the others are. We filter 
the variable set using the association with the contrast of 
one cancer type vs the rest, using S = {1}. Figure 3(a) shows 
some of the projection score curves. First, we can note that 
the range of ^-values, as well as the range of obtained pro- 
jection scores, are highly different for the different 
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Figure 2 Response-related filtering of the NCI-60 data, (a) The sample representation obtained by applying PCA to the most informative 
variable subset (482 variables) obtained by filtering with respect to the F-value for contrasting all nine cancer types. The different colors indicate 
different cancer types, (b) The sample representation obtained by applying PCA to the entire variable set (9,706 variables), (c) The projection 
score as a function of log 10 (a), where a is the p-value threshold used for inclusion, (d) The p-value distribution for all variables, indicating that 
there are truly significantly differentially expressed genes with respect to the contrast, (e) Heatmap of the most informative variable subset, that 
is, the one used to create the sample representation in (a). In panels (a) and (b), in order to obtain more easily interpretable plots, we joined the 
closest neighbors among the samples with line segments. The distance between two samples is defined by the Euclidean distance in the space 
spanned by all the remaining variables. The hierarchical clusterings in panel (e) are created using Euclidean distances and average linkage. The 
figures in (a), (b) and (e) were generated using Qlucore Omics Explorer 2.2 (Qlucore AB, Lund, Sweden). 
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Figure 3 Response-related filtering of the NCI-60 data - continued, (a) The projection score as a function of log 10 (a), where a is the p-value 
threshold used for inclusion when contrasting each of four cancer types with all the other eight types in the NCI60 data set. For the Melanoma, 
Leukemia and Renal types, small groups of variables form the most informative subsets. For the NSCLC type, the entire variable collection is the 
most informative variable subset, (b) The p-value distribution for all variables when contrasting NSCLC with all other groups, indicating that there 
are essentially no truly significantly differentially expressed genes for this contrast, (c) The p-value distribution for all variables when contrasting 
Melanoma with all other groups. 



contrasts. The highest projection scores in the respective 
cases are 0.416 (for the Melanoma vs the rest contrast), 
0.348 (Leukemia), 0.281 (Renal) and 0.164 (NSCLC). 
Apparently, for each of the Melanoma, Leukemia and 
Renal contrasts, a small subset of the variables related to 
the respective response contains a lot of non-random infor- 
mation. However, for the NSCLC contrast the full variable 
set (corresponding to log 10 a = 0) is the most informative. 
This can be understood from Figure 3(b), which shows a 
histogram over the ^-values obtained from the F-test con- 
trasting the NSCLC group with the rest of the samples. 
The /^-values are essentially uniformly distributed, indicat- 
ing that there are no truly differentially expressed genes in 
this case. Hence, in the filtering process we do not unravel 
any non-random structure, but only remove the variables 
which are informative in other respects. Figure 3(c) shows 
the ^-value distribution for the Melanoma contrast. In this 



case, there are indeed some differentially expressed genes, 
which means that in the filtering process, we purify this 
signal and are left with an informative set of variables. The 
projection scores obtained from the different contrasts are 
consistent with Figure 2(a), in the sense that the highest 
projection scores are obtained from the contrasts corre- 
sponding to the cancer types which form the most appar- 
ent clusters in this sample representation, that is, the 
Melanoma samples and the Leukemia samples. The 
NSCLC samples do not form a tight cluster and are 
not particularly deviating from the rest of the samples in 
Figure 2(a). 

Combined variance and response-related filtering of the 
NCI-60 data set 

In some cases, the variable set is first filtered by variance 
before a statistical test is applied. Removing supposedly 
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non-informative variables in this way may be beneficial in 
terms of statistical power, since if the number of per- 
formed tests is high, we need to do a rather heavy correc- 
tion for multiple comparisons. Also with the approach 
proposed in this paper, we can combine different filtering 
procedures to find a variable subset which is informative 
from more than one perspective. Here, we show how to 
combine variance filtering with response-related filtering 
for the NCI-60 data set. In this way, we exclude all vari- 
ables that obtain small ^-values from the F-test mostly 
due to their low variances. As before, we define a collec- 
tion of variance thresholds 6 m and a number of signifi- 
cance thresholds a m) for the F-test contrasting all nine 
cancer types. We choose S = {1, 2, 3} as before. Now, the 
projection score can be represented as a surface, shown 
in Figure 4(a), for varying values of the variance and p- 
value thresholds. The curves traced out for log 10 (a) = 0 
and 0 = 0 correspond to the projection score curves for 
the individual variance and response-related filterings, 
respectively. As can be seen in Figure 4(a), we get the 
maximal projection score for a variable subset obtained 
by a combination of variance filtering and response- 
related filtering. This subset includes all variables with a 
variance exceeding 5.8% of the maximal variance, and 
with a j?-value below 3.6 ♦ 10" 5 , in total 263 variables. The 
maximal projection score is r = 0.416. This can be com- 
pared to the purely response-related filtering, which gave 
a maximal projection score of r = 0.351 by including the 
482 variables with ^-values below 2.1 • 10~ 5 . Figure 4(b) 
shows the sample configuration obtained by applying 



PCA to the most informative variable subset from the 
combined filtering. 

Shaving of a leukemia data set 

Here, we will use the projection score to evaluate the 
informativeness of variable subsets obtained by the shav- 
ing procedure described in [25]. Briefly, for a given frac- 
tion tt s (0, 1), we define a function f : 2 {1 ""' p} -> 2 {w} 
by letting £(R) consist of the fraction 1 - n of the vari- 
ables in R having the highest loadings in the first princi- 
pal component of the data set consisting of the variables 
in R. The selection functions q> m are then defined by let- 
ting (p m 0Q contain all variables in fj^^J^ 1, ■ ■ ■ 'P}\ 

m 

We use Ji = 0.02, hence in each step shaving off 2% of 
the variables, continuing until only one variable remains. 
We compute the projection score for each of the vari- 
able subsets, as well as the gap statistic proposed in [25] 
(using the signed variables, see Methods section). 

We apply the described method to the leukemia data 
set studied in [34]. The data set contains gene expres- 
sion measurements from 7,129 genes in 38 samples 
with either AML (N = 11) or ALL (N = 27). Comput- 
ing the projection score and the gap statistic for the 
variable subsets obtained by shaving gives optimal vari- 
able subsets containing 691 variables (projection score) 
and 336 variables (gap statistic), respectively. Figure 5 
(a) shows the two informativeness measures as func- 
tions of the variable subset cardinality. Both curves are 
smooth and have clear extreme points. Figures 5(b) 
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(a) (b) 

Figure 4 Combined variance and response-related filtering of the NCI-60 data, (a) The projection score for various choices of the inclusion 
criteria for the variance (6, fraction of max variance) and the p-value from an F-test contrasting all nine cancer types (log 10 (a), where a is the p- 
value threshold). The optimal projection score is obtained by combining the two filtering procedures, (b) The sample representation obtained by 
applying PCA to the most informative variable subset. In panel (b), in order to obtain a more easily interpretable plot, we joined the closest 
neighbors among the samples with line segments. The distance between two samples is defined by the Euclidean distance in the space 
spanned by all the remaining variables. The figure in (b) was generated using Qlucore Omics Explorer 2.2 (Qlucore AB, Lund, Sweden). 
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(b) (c) 

Figure 5 Gene shaving of the leukemia data, (a) The projection score (black) and the gene shaving gap statistic (red) as functions of the 
cardinality of the variable subset, (b) Heatmap of the most informative variable subset by the projection score, consisting of 691 variables. Red - 
ALL, green - AML (c) Heatmap of the most informative variable subset by the gene shaving gap statistic, consisting of 336 variables. The 
hierarchical clusterings in panels (b) and (c) are created using Euclidean distances and average linkage. The figures in (b) and (c) were generated 
using Qlucore Omics Explorer 2.2 (Qlucore AB, Lund, Sweden). 



and 5(c) show heatmaps for the optimal variable sub- 
sets from the two methods. As can be seen in these 
figures, the two subsets contain much the same infor- 
mation about the samples in the data set. The variable 
subset selected by the projection score is larger than 
the one selected by the gap statistic. In some situa- 
tions, a small number of included variables may be 



Table 1 Sparsity detection, example 1 


Measure 


First component 


Second component 


Projection score 


10 (10-10) 


10 (9-10) 


Gap statistic 


10 (9-10) 


10 (8-10) 


SPC 


18.5 (10-46) 


21 (11-32) 


SSVD 


11 (10-13) 


10.5 (1-12) 



The median and range (across 10 instances) of the number of non-zero 
elements included in the optimal variable subset found by different methods. 
Each underlying component contains 10 non-zero elements. SPC - sparse PCA 
[22]. SSVD - sparse SVD [27]. The gap statistic is defined as in [25,26]. 



beneficial. However, as will be shown (Tables 1 and 2, 
see the section "Detecting sparsity in principal compo- 
nents" below), the gap statistic tends to underestimate 
the number of variables in the optimal subset (this was 
also noted in [26]), which may potentially be the case 
also in this example. This would then lead to a num- 
ber of "false negatives", that is, variables falsely 
excluded from the optimal combination. 

Validation on synthetic data 

In this section, we will validate the projection score by 
using it to find informative variable subsets from differ- 
ent filtering processes applied to synthetic data sets. 
Variance filtering 
We let 

_ | -0.5 if 1 < j < 50, 
Mj " {+0.5 if 51 < j < 100. 
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Table 2 Sparsity detection, example 2 



Measure 


First component 


Second component 


Projection score 


160 (156-160) 


152 (133-156) 


Gap statistic 


122.5 (106-133) 


78 (60-100) 


SPC 


260.5 (229-499) 


311.5 (280-494) 


SSVD 


162.5 (161-339) 


164 (162-313) 



40 samples which are divided into four consecutive 
groups of 10 samples each, denoted a, b, c, d. We define 



The median and range (across 10 instances) of the number of non-zero 
elements included in the optimal variable subset found by different methods. 
Each underlying component contains 160 non-zero elements. SPC - sparse 
PCA [22]. SSVD - sparse SVD [27]. The gap statistic is defined as in [25,26]. 



and generate a synthetic data set with 1,000 variables 
and 100 samples by letting 

Af{lij,crl) if 1 < x < 150; 

l<j<100 
A/"(0,0.5) if 151 < i < 1,000; 
1 < j < 100. 

The only non-random structure in the data is contained 
in the first 150 variables, discriminating between two 
groups of 50 samples each. By varying 0"i we obtain data 
sets with difference variance properties. With o x = 0.5, the 
informative variables and the non-informative variables 
have comparable variances. With Oi = 0.2, the informative 
variables obtain a lower variance than the non-informative 
variables and with C\ - 0.8 the informative variables are 
also those with the highest variances. 

We take S = {1}, since no other choice of S is sup- 
ported by any variable subset. This is also consistent 
with the structure of the data matrix. 

Across 20 realizations, the mean number of variables 
in the subset giving the best projection score are 710.2 
for a x = 0.5 (standard deviation 143.1), 999.9 for d = 
0.2 (standard deviation 0.30) and 118.3 for c Y = 0.8 
(standard deviation 15.0). The projection score correctly 
indicates that when 0"i = 0.2, the informative structure 
in the data is indeed related to the variables with lowest 
variances, and hence all variables are included in the 
most informative subset (that is, no variance filtering). 
Note that the association between the variables within 
each sample group is very strong when <Ji - 0.2. If the 
variables with lowest variance had been routinely filtered 
out in this example, we would lose the informativeness 
in the data. It can also be noted that when the number 
of variables is below a certain threshold (approximately 
850) in the o x = 0.2 case, not even S = {1} is supported 
by the data since we have filtered out all the informative 
variables. For C\ - 0.5, the optimal number of variables 
is highly dependent on the specific data set since the fil- 
tering removes both informative and non-informative 
variables uniformly. 
Response-related filtering 

In this example, we simulate a data matrix containing 
two group structures (see [35]). The data set consists of 



Ml; 
M2j 



-2ifj e a Ub, 
+2 if 7 ecUd. 

— 1 if j e a Uc, 
+ 1 if 7 e bud. 



The data matrix is then generated by letting 

A/X/xij, 1) if 1 < i < 200; 

1 < j < 40 
A/"(/x 2 j, 1) if 201 < i < 250; 

1 < j < 40 
A/"(0, 1) if 251 < i < 1,000; 
1 < j < 40. 

We perform a two-sided £-test contrasting a D c and b 
U d and order the variables by the absolute value of their 
^-statistic for this contrast. In this case, since we compare 
only two groups we are essentially searching for a one- 
dimensional separation, so we choose S = {1}. Figure 6(a) 
shows the structure of the data set. The data set contains 
one very strong factor, encoded by the first 200 variables, 
and one weaker factor, the one we are interested in, 
which is related to the next 50 variables. Figures 6(b) and 
6(c) show the projection score for different thresholds on 
the significance level a and for different variable subset 
cardinalities, respectively. The optimal projection score 
(approximately 0.33) is obtained for variable subsets con- 
taining slightly less than 50 variables (mean value across 
20 simulations of 38.0, standard deviation 4.6, range 
30-46). Hence, even though there is indeed much infor- 
mation contained in the entire variable set, it is possible 
to obtain an even more informative variable subset by fil- 
tering. Projecting onto the first principal component 
based only on the variables in the most informative sub- 
set clearly discriminates between a D c and b U d, as 
shown in Figure 6(d). As above, we can compare the 
maximal projection score corresponding to filtering by 
the association with different responses. Filtering with 
respect to the association with the contrast between a U 
b and c U d, that is, the stronger factor in the data set, 
gives a maximal projection score around 0.60. Hence, 
this correctly suggests that this factor is more informative 
than the previously studied. Filtering with respect to the 
variance, still using S = {1}, gives a maximal projection 
score of 0.68, obtained for approximately 200 variables 
(the variables related to the first factor in the data). This 
implies that the variables with high variance in this case 
are even more informative than the variables closely asso- 
ciated with one of the responses, in the sense that the 
encoded information deviates more from what would be 
expected from the highly varying variables in a rando- 
mized data set. Applying variance filtering with S = {1, 2} 
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Number of included variables 



(0 (d) 
Figure 6 Response-related filtering of a synthetic data set. (a) The structure of the data set. The columns represent samples and the rows 
represent variables, (b) The projection score as a function of log 10 (a), where a is the p-value threshold used as inclusion criterion, (c) The 
projection score as a function of the cardinality of the variable subset, (d) The projection of the samples onto the first principal component from 
the most informative variable subset, consisting of 40 variables. 



provides a most informative variable subset containing 
222 variables, capturing parts of both the two informative 
variable groups in the data (note that the variances of the 
variables in the second, smaller group are not all larger 
than the variances of the non-informative variables). S = 
{1, 2, 3} is not supported by any variable subset. 
Detecting sparsity in principal components 
In this example, we will evaluate the usefulness of the pro- 
jection score for detection of sparsity in the leading princi- 
pal components of a data matrix. We generate a data set, 
with p - 500 variables and N = 50 samples, following the 
scheme given in [21]. Briefly, we form a matrix V = [vi, 
Y p ] from an orthonormal set of p principal components Vi, 
y p g R p , and a matrix C = diag[c\, c p ) containing the 
eigenvalues in decreasing order on the diagonal. Then, we 
form the covariance matrix E by 

P 

2 = J2 CiYiY i = VCVT 

1=1 



To generate data, we let Z be a random draw from 
Af(0,I p ) and take X = VC 1/2 Z. The first two principal 
components (vi and v 2 ) are constructed to have specific 
sparsity patterns. In the first example, we let Vi and v 2 
contain 10 non-zero elements each (all of equal magni- 
tude and selected in such a way that the two sets of non- 
zero elements are non-overlapping) and choose c x - 30, 
c 2 = 16. We let c k = 1 for k = 3, 500 and sample the 
entries of the corresponding principal components uni- 
formly between 0 and 1 before orthogonalizing and nor- 
malizing them. In the second example, we let Vi and v 2 
instead contain 160 non-zero (equal) elements. We 
choose Ci = 400 and c 2 = 200 and proceed as in the pre- 
vious example to obtain the rest of the components. 

The variable subsets are obtained by the shaving pro- 
cedure, as outlined above and described in [25]. This 
gives a nested sequence of variable subsets, which can 
be evaluated in terms of their projection scores. We also 
compute the gap statistic as defined in [25] as another 
quality measure of each variable subset. The optimal 
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variable subsets obtained from these methods are com- 
pared to those found by a sparse PCA algorithm [22] 
and the sparse SVD proposed in [27]. It should be 
noted that the aims of these methods are somewhat dif- 
ferent. The sparse PCA attempts to find a good approxi- 
mation of the original data set using only a subset of the 
variables, while the sparse SVD was developed for 
biclustering, that is, finding small groups of variables 
which are related across possibly only a subset of the 
samples. Hence, also the sample representation can be 
sparse (that is, only describing some of the samples) in 
the results from the sparse SVD. In the examples stu- 
died here, however, the sparsity patterns of the principal 
components are designed to be identical across all sam- 
ples. The sparse SVD and the sparse PCA were applied 
through the R programs provided by the respective 
authors. The sparse SVD was applied with y u = y v = 2, 
as suggested by the authors. The sparsity parameter in 
the sparse PCA was estimated via the cross-validation 
function provided with the R package, evaluating 100 
different sparsity levels between 1 and ^/500 via 5-fold 
cross-validation. 

Table 1 and Table 2 give the median and the range of 
the number of non-zero entries in the first two principal 
components across 10 instances of each example. To 
obtain the second component, we orthogonalized the 
observed data matrix with respect to the first component 
and extracted the first component from the orthogona- 
lized matrix. The results in Table 1 and Table 2 suggest 
that the projection score is able to detect the underlying 
sparsity structure of the principal components. The gap 
statistic tends to underestimate the number of variables 
in the optimal variable subset, as was also noted in [26]. 
The sparse PCA consistently overestimates the number 
of non-zero entries of the components. The sparse SVD 
performs well in many cases, but the variability is larger 
than for the projection score-based method. 

Estimating £ Vx [(a 2 (A x , S)) 1/2 j 

To obtain E-p x ^(a 2 (A x , S))^ 2 j, we repeatedly permute 

the values in each row of X and perform the variable filter- 
ing, which is computationally expensive (see Methods sec- 
tion). A more efficient implementation can be obtained if 
we specify the distribution V^x) for each m = 1, M in 

advance. Then, the values of E-p 0m(x) [(a 2 (A 0m ( X ), S)) 1/2 ] 

can be estimated in advance and stored, which leaves only 
the calculations for the observed data matrix and the sub- 
traction of a known quantity for each variable subset. For 
instance, we can fix V(p m {x) by assuming that each element 
in (p m (X) is drawn from a standard normal distribution, 
that is, 



(0m(X))ij 6^(0,1) 

for i = 1, \R m \ and ; = 1, N. We can then calcu- 
late the expected value of (a 2 (A0 m (x), S)) 1 / 2 for a large 
collection of choices of variable subset cardinalities and 
sample sizes. Figure 7 shows an interpolated surface for 
10 < \R m \ < 2, 000 and 10 < N < 100. This more com- 
putationally efficient approach may be used for example 
for variance filtering, if the observed data matrix is stan- 
dardized before the principal components are extracted. 

When the variable subsets are defined by response- 
related filtering, it is more difficult to specify and sample 
from the truncated distribution resulting from the filter- 
ing. In particular, we note that even for a data matrix X 
containing only independent variables, we still expect to 
see an aggregation of the variance in the first principal 
components when we consider small submatrices (p m (X). 
In some cases, however, it may still be of interest to 
compare the singular values from the observed data to 
those from matrices with a given, known distribution 
such as the one described above. We next give a small 
example to show the different conclusions that can 
result if different definitions of V^ m (x) are used. 
Example 

We define X e R 1 ' 000 x 40 by letting x {] e A/"(0, 1) for i = 
1, 1, 000; ; = 1, 40. We divide the samples into 
four consecutive groups a, b, c, d of 10 samples each, as 
in the response-related filtering above, and filter the 
variable set based on the absolute value of the ^-statistic 
contrasting groups aUc and bud, using S - {1}. If 'p0 m (x) 
is defined by assuming that each element of (p m (X) is 
drawn from a standard normal distribution, the most 
informative variable subset contains 11 variables. Here, 
even though we study a completely random matrix and 
an artificial grouping of the samples, we obtain a good 
separation between the groups. However, the projection 
score is not very high in this case (r = 0.119). Figure 8 
shows the projection score as a function of the signifi- 
cance threshold, and the optimal projection. As can be 
seen, just looking at the visualization in the right panel 
we might be led to believe that there is actually some 
non-random structure in the data. On the other hand, if 
we define ^ m (x) as we have done in the previous exam- 
ples, by assuming independence among the samples and 
variables of the original matrix X and then filtering, no 
submatrix supports even S = {1}, and hence we get an 
indication that we are filtering a matrix without non- 
random relationships between the variables. 

Conclusions 

In this paper, we have introduced and shown the useful- 
ness of the projection score, a measure of the informa- 
tiveness of a subset of a given variable set, based on the 
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Figure 7 Estimation of E-p 0 ^ (x) [(a 2 (A0 m ( X ),S)) 1/2 j The figure shows E P<MX) [(a 2 (A 0m ( X ), S)) 1/2 j with 5 = {1, 2}, for different 
choices of \R m \ and N, when the null distribution 7^ m (x) is defined by assuming that the matrix elements are drawn independently from 
A/"(0, 1} Red values correspond to high values of E-p 0m(x) ^((^(A^pg, S))^ 2 j, while blue values correspond to low values of this 
statistic. 



variance contained in the corresponding sample config- 
uration obtained by PCA. The definition of the projec- 
tion score is straightforward and the interpretation in 
terms of captured variance is intuitively appealing in a 
PCA visualization context. Moreover, the projection 
score allows a unified treatment of variable selection by 
filtering in the context of visualization, and we have 
shown that it indeed gives relevant results for three 



different filtering procedures, both for microarray data 
and for synthetic data. By filtering with respect to a spe- 
cific factor, we obtain sparse principal components 
where all variables receiving a non-zero weight are 
indeed strongly related to the chosen factor. In this 
respect, the resulting components may be more easily 
interpretable than general sparse principal components, 
where the variables obtaining a non-zero weight can be 
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(a) (b) 

Figure 8 The effect of the choice of null distribution, (a) The projection score as a function of log 10 (a), where a is the p-value threshold 

obtained by applying a two-sided t-test for comparing two artificial sample groups, when 7^0 m (x) is defined by assuming all matrix elements 

are normally distributed, (b) The projection obtained by applying PCA to the most informative variable subset, containing 1 1 variables. 
\ J 



related to many different factors. The optimal number 
of variables included in the extracted principal compo- 
nents often increases with |5| and in many cases small 
variable subsets do not even support large subsets S. 

Methods 

The projection score 
Definition and basic properties 

Let X = [xi, x N ] g R pxN be a given matrix with rank 
r, containing N samples of p random variables. Principal 
Component Analysis (PCA) [1-4] reduces the dimen- 
sionality of the data by projecting the samples onto a 
few uncorrelated latent variables encoding as much as 
possible of the variance in the original data. Assuming 
that each variable is mean-centered across the samples, 
the empirical covariance matrix (scaled by N - 1) is 
given by XX r . The covariance matrix is positive semi- 
definite with rank r, so by the spectral theorem we have 
a decomposition 

XX T V = VA|. 

Here V = [vi, v r ] is a p x r matrix such that V r V = 
I r , where I r is the r x r identity matrix, and A x = diag 
(Ai(X), A r (X)) is the r x r diagonal matrix having the 
positive square root of the non-zero eigenvalues of XX r 
(that is, the singular values of X) in decreasing order 
along the diagonal. 

The orthonormal columns of V contain the weights of 
the principal components, and the coordinates of the 
samples in this basis are given by U = X r V. We obtain 
a lower-dimensional sample configuration by selecting 
the columns of U with index in a specified subset S <= 
{1, r}. The rows of this matrix then provide a repre- 
sentation of the samples in an \S\ -dimensional space. In 
this paper the aim is to find particularly instructive such 



sample configurations for a given choice of S, by includ- 
ing only a subset of the original variables with non-zero 
weights in the principal components. 

The first principal component is the linear combina- 
tion of the original variables which has the largest var- 
iance, and the variance is given by A^(X). Similarly, the 
second principal component has the largest variance 
among linear combinations which are uncorrelated with 
the first component. Given a subset S <= {1, r }, the 
fraction of the total variance encoded by the principal 
components with index in S is consequently 

This interpretation implies that a 2 can be used as a 
measure of the amount of information captured in the 
corresponding \S\ -dimensional sample configuration. In 
some applications, it is fairly straight-forward to select a 
suitable subset S. The synthetic example above, where 
we filter the variables by their association to a response 
variable dividing the samples into two groups, is such a 
case, where we expect a one-dimensional signal and 
hence select S = {1}. In other applications, we may select 
S = {1, 2} or S = {1, 2, 3} in order to make visualization 
possible. It is also possible to try several different 
choices of S for the same data set, in order to possibly 
detect different patterns. A specific choice of S effec- 
tively indicates which part of the data is to be consid- 
ered as the "signal" of interest, and the rest is in some 
sense considered irrelevant. For a given S the expected 
value of a 2 depends heavily on the size and the underly- 
ing distribution of the matrix X. This should be taken 
into account in order to obtain a reasonable measure of 
the informativeness of X. We introduce the projection 
score as follows: 
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Definition 1 (Projection score). Let X e R p be a 
matrix with rank r. For a given matrix probability distri- 
bution Vx and a subset S <= {1, r} we define the pro- 
jection score r(X, S, Vx) by 

r(X,S,Px) 

= (cx 2 (A x ,S)) 1/2 - £ Vx [(a 2 (Ax f S)^ 2 ] , 

where A x = (A^X), A r (X)) is the length-r vector 
containing the non-zero singular values of X in decreas- 
ing order. 

While much work is going on in the field of random 
matrix theory and eigenvalue distributions for random 
matrices (see e.g. [36-38]), most results are asymptotic 
and only valid under certain distributional assumptions 
on the matrix elements. Since we do not expect such 
assumptions to hold true in general for submatrices 
obtained by variable selection procedures, it is often 
necessary to use randomization methods to estimate 

The relationship between the projection score and the 
quality measure from the gene shaving algorithm 

As described in the previous section, the projection 
score is a measure of the informativeness of a variable 
subset. Let X e R kxN denote the submatrix correspond- 
ing to a certain variable subset. Letting X t j denote the 
value of the f th variable in the subset for the /th sample 
and assuming S = {1}, the informativeness measure used 
by the projection score is given by 

Tj^f=l fell P&ij) 
a 2 = r r T 

where /3 = (fi lf p k ) is the first principal component of 
X. The expression in the denominator is the variance of 
the variable subset, and the numerator is the variance of 
a weighted average of the variables in the subset, where 
each variable is weighted by the corresponding element 
of the first principal component. In [25,26], the authors 
use another quality measure of a variable subset, and 
apply it to subsets generated by a shaving procedure. The 
quality measure proposed in [25,26] is given by 

p2 Ag,((a,i*H)' 

m Si=i Ej=i i x ij - x ) 

where x denotes the overall mean value of X. To 
allow both positively and negatively correlated variables 
in the same cluster, the values are multiplied by the sign 
of the corresponding element of the first principal com- 
ponent of the variable subset before computing R 2 . This 



quality measure is similar to our informativeness mea- 
sure, but all variables are weighted equally, by Ilk. 
Hence, in this case the Euclidean norm of the weight 
vector depends on /c, while for the weight vector in the 
projection score we have H^H^ = 1. As noted in [26], the 
R 2 measure is biased towards small cluster sizes. 

How to use the projection score 

In this section, we will outline how the projection score 
can be used to compare the informativeness of M differ- 
ent submatrices of a given matrix X. For a given collec- 
tion of variable subsets R m , m = 1, M, we define 
functions 

for m = 1, M by letting (p m {X) be a submatrix of X 
containing only the rows corresponding to the variables 
in R m . For filtering purposes, the number of variables to 
include in each submatrix (|i? m |) can be determined by 
setting threshold levels on an underlying statistic in the 
observed matrix X. For example, one can successively 
include all variables with variances greater than 1%, 2%, 
... of the maximal variance among all variables. Note 
that this is only one example of how the variable subsets 
can be defined, and that there are many other possibili- 
ties. For example, we could successively include the 100, 
200, . . . variables with the largest variances. Given a 
null distribution V^x) for each <p m (X), we can calculate 
the projection score r(0 m (X), S, 7^ m (x)) for m = 1, M. 
For a fixed «S, a subset R m is then said to be more infor- 
mative than a subset R n if 

r(0 m (X),S,P 0m( x)) > r(0 n (X),S,7> 0n( x)). 

Note that the same S should be used for both subsets. 
In general, it is very difficult to compare projection 
scores obtained for different choices of S. 

The null distribution V^ m (x)> for matrices of dimension 
\R m \ x N, can be defined in different ways. One particu- 
larly simple way is to assume that every matrix element 
is drawn independently from a given probability distri- 
bution, e.g. Xij G A/"(0, 1) for i = 1, \R m \ and ; = 1, 
Af. Then, the optimal projection score is obtained for 
the submatrix whose singular values deviate most from 
what would be expected if all matrix elements were 
independent standard normally distributed variables. 
However, even if the original data set consists of inde- 
pendent normally distributed variables, this is not in 
general true after applying (p m . This means that even a 
submatrix obtained by filtering independent normally 
distributed variables may be far from the null distribu- 
tion defined this way. 

In the applications in this paper, we define Vx by per- 
mutation of the observed data, assuming that X consists 
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of N independent samples of p independent variables. 
We then define the null distribution of each variable in 
<p m (X) by truncation of the corresponding distribution 
obtained from 7\, with respect to the features of (p m . 
For real microarray data sets, which often contain a 
considerable amount of non-random variation, the vari- 
ables are generally far from independent. Therefore we 
expect that the variance encoded by the leading princi- 
pal components of the real data is larger than for the 
data generated under the null model, and the basic idea 
behind the projection score is to use the difference 
between them as an informativeness measure. 

In practice, we generate B data matrices X* b , b = 1, 
B from Vx by permuting the values in each row of X 
independently. For each X* b we compute a2{A x *b,S), 
and the expected value of (a 2 (A x , S)) in under the prob- 
ability distribution V x is then estimated as 



1 B 

£ Vx [{a 2 {A Xf S))^ 2 ] = - J> 2 (A X *>,S)) 

b=l 



1/2 



Similarly, E^ m(x) [(a 2 (A 0in( x), S)) 1/2 ] is estimated by 

repeated permutation of the values in each row of X, 
followed by application of cp m to the permuted matrix. 
Hence, 



E nm(x) [(a 2 (A 0l „ (X ),S)) 1/2 ] 

4E(« 2 ( A 0„(X-)'S)) 1/2 . 



b=l 

For each b, the variable subsets R m are re-defined in 
such a way that each <j9 m (X*^) contains the same number 
of variables as <p m (X). In our examples, we permute the 
data matrices B = 100 times. 

When the number of variables is decreased by filter- 
ing, the true dimensionality of the resulting data set 
(that is, the number of non-trivial principal compo- 
nents) may change. We say that a submatrix (p m (X) sup- 
ports a given S if the variance accounted for by each of 
the principal components of (p m (X) with index in S is 
large enough. More specifically, we estimate the distri- 
bution of A^(0 m (X)) for each h 5 under the probabil- 
ity distribution ^0 m (x> If the estimated probability of 
obtaining a value of A^(0 m (X)) at least as large as the 
observed value is less than 5% for all k e S, we say that 
the submatrix <p m (X) supports S. In practice, the null 
distribution of A^(0 m (X)) is estimated from the singular 
values of the submatrices <p m (X*^). Permutation methods 
similar to this approach, comparing some function of 
the singular values between the observed and permuted 
data, have been used and validated in several studies to 
determine the number of components to retain in PCA 
[4,31,39]. 



Generalizations of the projection score 

In this section we will indicate some possible generaliza- 
tions of the projection score defined above. 
Generalized measure of information 

We noted above that cc 2 (A x , S) is a natural measure of 
the information contained in the principal components 
of X with index in S. More generally, we can use any £ q 
norm (q > 1) instead of the £ 2 norm to measure infor- 
mation content, giving a measure of the explained frac- 
tion of the information content as 



^(A x , S) 



By replacing (a 2 (A x , S)) in in the definition of the pro- 
jection score with, for example, a function of the form 

g =ho af 

where h : R — > R is an increasing function, we obtain 
a generalized projection score as 

T S (X,S,V X ) 

= S(Ax,S)-E Px [s(A x ,S)]. 

An alternative generalization is obtained by normaliz- 
ing by taking the quotient of the observed and expected 
value of g{A x , S) instead of the difference, that is, defin- 
ing the generalized projection score as 



<T g {X,S,Vx) 



g(Ax,S) 
E Px [g(A x ,S)]- 



Supervised projection score 

The projection score evaluates only the informativeness 
of the variable subsets with respect to visualization by 
PCA. We can make the variable subset selection (par- 
tially) supervised by incorporating a term quantifying 
the association of the variable subset with a response 
variable into the projection score. The term can be, for 
example, the classification ability of the variable subset 
(estimated by cross-validation) or the correlation 
between an aggregate of the variables and a quantitative 
response. 
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